This analysis was initiated by Urška Sršen, co-founder of Bellabeat. Sršen knows that an analysis of available data on consumers’ smart device usage would reveal opportunities for the company to grow, and has provided the following business task:
Analyse smart device usage data to gain insight into how people are already using smart devices, then generate high-level recommendations for how these insights can inform the marketing strategy for one Bellabeat product.
For this analysis, I chose to produce recommendations for the Bellabeat Ivy. At time of writing this was Bellabeat’s most fully-featured product, and the one that most closely aligned with the kind of functionality expected from existing wearable smart devices, so I considered it the product that my available data would be the most directly applicable to.
For this analysis I wanted a tool or set of tools with the following features:
With these requirements in mind, I considered three tools: R, spreadsheets, and databases:
| Feature | R | Spreadsheets | Databases |
|---|---|---|---|
| Power for large data sets | Yes | No | Yes |
| Data manipulation tools | Yes | Yes | Yes |
| Data analysis tools | Yes | Yes | No |
| Data visualisation tools | Yes | Yes | No |
| Separate analysis files | Yes | No | Yes |
| Streamlined report generation | Yes | No | No |
| Straightforward to learn | No | Yes | No |
Clearly, R is the best single tool for this analysis. R lacks the straightfoward operation of spreadsheet tools, and I will need to learn libraries and programming techniques as I do my analysis, but this is acceptable given my prior experience with other programming languages like Python.
Note: Python itself was not considered for this analysis: while it shares most of the features, advantages, and disadvantages of R, I’m already familiar with Python and wanted to use this case study to familiarise myself with R instead.
I set up my RStudio environment using the following packages:
For this data analysis, I’ll be making use of one public data set specified by Sršen. The data set specified is the FitBit Fitness Tracker Data Set. This is a public data set available under the CC0 License via Kaggle user Mobius.
For the initial analysis, the data set was downloaded in its entirety from Kaggle and stored locally on my computer. This provided a baseline for analysis, with any modifications to file names, folder re-structuring, or removal of unnecessary data tables to be conducted after I’m familiar with the raw data. The CSV files in the data set were loaded directly into the environment as data frames.
The database contains data on the following features of the FitBit devices:
| Feature | Units | Sampling Rate |
|---|---|---|
| BMI | BMI | Manual/automatic logging |
| Body Weight | kg | Manual/automatic logging |
| Body Fat | % | Manual/automatic logging |
| Calorie burn | Calories | 1 minute |
| Distance | Unknown | 1 minute |
| Heart Rate | Beats per Minute (BPM) | 5 seconds |
| Intensity (of exercise) | Factor, 0 to 3 | 1 minute |
| METs (during exercise) | METs | 1 minute |
| Sleep | Factor, 1 to 3 | 1 minute |
| Steps | Steps | 1 minute |
The file names in the data set present the following issues:
To correct each of these issues, I’ll rename the files using this naming convention:
[feature]_[src/sum]_[interval]_[shape].[filetype]
For example, “sleepDay_merged.csv” will be renamed to “sleep_sum_days_wide.csv”. This format removes the ambiguities in the original file names, and ensures all required information on the contents of the data is included.
All variables in the data set will be changed from CapitalisedCase to snake_case, by convention and to address an issue with inconsistently-capitalised variables (e.g. logId/logid).
Some of the tables in the data set use variables of a type unsuitable for analysis. These variables and their required modifications are tabulated below:
| Variable | Original Type | Required Type | Issue |
|---|---|---|---|
| ActivityDay | chr | datetime | Cannot perform datetime operations on chr variables |
| ActivityHour | chr | datetime | Cannot perform datetime operations on chr variables |
| ActivityMinute | chr | datetime | Cannot perform datetime operations on chr variables |
| Date | chr | datetime | Cannot perform datetime operations on chr variables |
| Id | num | chr | Disable scientific notation and numerical operations (IDs are not a numeric value) |
| LogId | num | chr | Disable scientific notation and numerical operations (IDs are not a numeric value) |
| SleepDay | chr | datetime | Cannot perform datetime operations on chr variables |
| Time | chr | datetime | Cannot perform datetime operations on chr variables |
The “bodycomp_logs_src_wide.csv” file contains both kilogram and pound variables: these describe the same information, and all of the observations contain values for both variables, so one variable can be dropped with no loss of data. The choice between the two formats seems arbitrary for my analysis, so I’m choosing to keep the kilos data as its expressed in an SI unit.
intensity_sum_hours_wide.csv contains Total Intensity and Average Intensity. Total intensity values exceed 4, the maximum for intensity, and so are not actually useful. Average Intensity is just the Total Intensity for each hour divided by 60 minutes per hour.
All of the numeric variables in the data set have clearly defined units except for “Distance”. Distance does not appear to have a source data table: it’s only included in the “activity_sum_wide_days” and “intensity_sum_wide_days” tables, and only as summary data grouped via Intensity level. Daily values between 0 and 1 are present, so it seems reasonable to assume this is either kilometers or miles, as opposed to meters or feet. Miles and kilometers represent the same information on slightly different scales, so the analysis should not be affected as long as the units are consistent across tables, which they appear to be. I will therefore assume the distances are given in kilometers for this analysis.
Two variables in the data set, “Intensity” (for exercise) and “Value” (for sleep), are numerical factors with no defined range. For these factors, I couldn’t tell from the data alone whether all possible values that a FitBit can record are present. The values for exercise intensity, for example, range from 0 to 3; it could have been the case that the FitBits used only generate four levels of intensity, but it could just as easily have been that the values go up to 100 (i.e. a percentage). This had clear implications for the analysis: if 3 was the max, records of 3 indicated users wear their FitBits while exercising as hard as they can, whereas if 100 was the max, records of 3 indicated users wear their FitBits while sitting on the couch as hard as they can. The meaning of these variables had to be validated before the analysis could continue.
As discussed above, the exercise-related “Intensity” variable is a numerical factor with no defined range: the meaning and limits of this variable had to be determined before the analysis could proceed. I started by determining the range of values present in the source Intensity data:
## [1] 0 1 2 3
The “activity_sum_days_wide” table provided some context clues as to what these four levels might mean: the table summarised Intensity data into four new variables which, based on their names, appeared to be associated with intensity levels as follows:
I confirmed this naming convention by using it to reconstruct the daily summary data in “activity_sum_days_wide”, then comparing the two tables and investigating any rows with non-zero differences:
## Printing all rows with non-zero diff values:
## Rows: 465
## Columns: 5
## $ Id_ActivityDate_UID <chr> "1503960366_2016-04-12", "1503960366_2016-…
## $ diff_minutes_sedentary <dbl> 346, 407, 442, 400, 679, 320, 377, 364, 38…
## $ diff_minutes_lightly_active <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ diff_minutes_fairly_active <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ diff_minutes_very_active <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
The approach above produced no errors, except for the “sedentary minutes” calculation, which was consistently higher in the reconstructed data.
I concluded that I got the mapping correct, given that:
That being said, I still didn’t know if those were the categories FitBits work in, not another naming convention that the data authors came up with, and I didn’t know if all FitBit models worked this way. To address this, I referred to the manuals of the relevant FitBits.
Activity trackers like FitBits detect activity intensity partly by measuring the user’s heart rate while exercising: a higher heart rate corresponds with a higher degree of exertion. As of April 2016, the three latest FitBit models with heart-rate tracking were:
The product manuals for each model confirmed they all break down user activity into four default heart-rate zones:
| Product | HR Zone 1 | HR Zone 2 | HR Zone 3 | HR Zone 4 |
|---|---|---|---|---|
| FitBit Blaze | “Out of Zone” | “Fat burn” | “Cardio” | “Peak” |
| FitBit Charge HR | “Out of Zone” | “Fat burn” | “Cardio” | “Peak” |
| FitBit Surge | “Out of Zone” | “Fat burn” | “Cardio” | “Peak” |
While those weren’t exactly the same terms as used in the data set, they were clearly related - “Out of Zone” equated to “Sedentary”, for example.
All three FitBit manuals also made the same claim that the default zones were “based on American Heart Association recommendations”. Even without validating that claim, it indicated to me that the reasoning behind each zone was not arbitrary, and was consistent across devices, so I concluded that any other FitBit models circa 2016 would follow the following classification scheme:
As discussed above, the sleep-related “Value” variable is a numerical factor with no defined range: the meaning and limits of this variable had to be determined before the analysis could proceed. As with exercise intensity, I started by determining the range of values present in the data:
## [1] 3 2 1
The values for sleep quality range from 1 to 3. The summary data tables for sleep quality introduced only two new variable names: “Total Time In Bed”, and “Total Minutes Asleep”. There wasn’t a valid name for each level of factor like there was for exercise, so in this case I referred straight back to the manuals:
Further research in the FitBit help pages on how to track sleep stats and what they all mean revealed that different devices track slightly different data if they have heart rate tracking:
The help pages also singled out the Charge HR and the Surge as the only HR-tracking FitBits to not have full sleep stage tracking, leaving the Blaze as the only device from this time period with that feature. Blaze aside, motion-based sleep quality tracking appeared to go all the way back to the FitBit One. Given this information, I concluded that any other FitBit models circa 2016 would follow the following classification scheme for the sleep data:
My tests revealed that I had this backwards, and the actual scheme was:
As with the intensity data, I confirmed this naming convention by using it to reconstruct the daily summary data in “activity_sum_days_wide”, then comparing the two tables and investigating any rows with non-zero differences:
## Printing all rows with non-zero diff values:
## Rows: 16
## Columns: 4
## $ Id_SleepDay_UID <chr> "4388161847_2016-04-27", "4388161847_2016-04-30", "438…
## $ recordDiff <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ sleepDiff <dbl> 22, 1, 6, 5, 1, 1, 2, 2, 520, 520, 16, 2, 10, 8, 20, 3
## $ bedDiff <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 543, 543, 0, 0, 0, 0, 0, 0
The two +500-minute values were investigated and found to be the result of double-sampling in the data. That day had two duplicate logs, which in the original sum data was presented as two duplicate rows with the accurate 500-minute value, and in my data was presented as two rows with the full 1000 minutes of data summed. All other rows were identical to or within 22 minutes of the original, which I considered accurate enough to validate the mapping.
The following broad cleaning steps were taken:
These steps are further detailed in the sections below.
For this analysis, I renamed the files according to the following convention:
[feature]_[src/sum]_[interval]_[shape].[filetype]
Applying the naming convention to the data set yielded the following file names:
| Original | Updated |
|---|---|
| dailyActivity_merged.csv | activity_sum_days_wide.csv |
| dailyCalories_merged.csv | calories_sum_days_tall.csv |
| dailyIntensities_merged.csv | intensity_sum_days_wide.csv |
| dailySteps_merged.csv | steps_sum_days_tall.csv |
| heartrate_seconds_merged.csv | heartrate_src_seconds_tall.csv |
| hourlyCalories_merged.csv | calories_sum_hours_tall.csv |
| hourlyIntensities_merged.csv | intensity_sum_hours_wide.csv |
| hourlySteps_merged.csv | steps_sum_hours_tall.csv |
| minuteCaloriesNarrow_merged.csv | calories_src_mins_tall.csv |
| minuteCaloriesWide_merged.csv | calories_sum_mins_wide.csv |
| minuteIntensitiesNarrow_merged.csv | intensity_src_mins_tall.csv |
| minuteIntensitiesWide_merged.csv | intensity_sum_mins_wide.csv |
| minuteMETsNarrow_merged.csv | mets_src_mins_tall.csv |
| minuteSleep_merged.csv | sleep_src_mins_tall.csv |
| minuteStepsNarrow_merged.csv | steps_src_mins_tall.csv |
| minuteStepsWide_merged.csv | steps_sum_mins_wide.csv |
| sleepDay_merged.csv | sleep_sum_days_wide.csv |
| weightLogInfo_merged.csv | bodycomp_src_logs_wide.csv |
The conversion was performed manually on my local device.
Each data frame was checked for NULL values and non-null empty strings.
The only table with missing data was “bodycomp_src_logs_wide”, which was missing all but two of the data points for Body Fat Percentage. Given this lack of data, Body Fat Percentage was excluded from the data analysis.
Variables were cleaned and renamed were required according to the following specification:
| Original Name | Updated Name |
|---|---|
| activity_date | activity_day |
| date (bodycomp_src_logs_wide.csv) | bodycomp_datetime |
| date (sleep_src_mins_tall.csv) | sleep_minute |
| log_id (sleep_src_mins_tall.csv) | sleep_log_id |
| log_id (bodycomp_src_logs_wide.csv) | bodycomp_log_id |
| fairly_active_distance | distance_fairly_active |
| fairly_active_minutes | minutes_fairly_active |
| light_active_distance | distance_lightly_active |
| light_active_minutes | minutes_lightly_active |
| lightly_active_distance | distance_lightly_active |
| lightly_active_minutes | minutes_lightly_active |
| logged_activities_distance | distance_logged_activities |
| me_ts | mets |
| moderately_active_distance | distance_moderately_active |
| sedentary_active_distance | distance_sedentary |
| sedentary_distance | distance_sedentary |
| sedentary_active_minutes | minutes_sedentary |
| sedentary_minutes | minutes_sedentary |
| step_total | steps_total |
| time (heartrate_src_seconds_tall.csv) | heart_rate_second |
| total_distance | distance_total |
| total_intensity | intensity_total |
| total_minutes_asleep | minutes_asleep_total |
| total_sleep_records | sleep_records_total |
| total_steps | steps_total |
| total_time_in_bed | minutes_in_bed_total |
| tracker_distance | distance_tracker |
| value (heartrate_src_seconds_tall.csv) | heart_rate |
| value (sleep_src_mins_tall.csv) | sleep_rank |
| very_active_distance | distance_very_active |
| very_active_minutes | minutes_very_active |
Duplicate rows were removed from all data frames using the anyDuplicated and distinct functions
Variables were renamed according to the following specification:
| Variable | Original Type | Updated Type | Reason |
|---|---|---|---|
| activity_day | chr | datetime | Cannot perform datetime operations on chr variables |
| activity_hour | chr | datetime | Cannot perform datetime operations on chr variables |
| activity_minute | chr | datetime | Cannot perform datetime operations on chr variables |
| date | chr | datetime | Cannot perform datetime operations on chr variables |
| id | num | chr | Disable numerical operations (IDs are considered a UID string) |
| log_id | num | chr | Disable numerical operations (IDs are considered a UID string) |
| time | chr | datetime | Cannot perform datetime operations on chr variables |
| sleep_rank | num | factor 1:3 | Disable numerical operations (value is a ranking, not an amount) |
| intensity | num | factor 0:3 | Disable numerical operations (value is a ranking, not an amount) |
Given the large number of variables in the data set, the recasting procedure includes test code to confirm the updated variable types match the types specified in the variable mods list. The test code works by first generating a list of the desired final variable/type pairs, then, once the conversion is completed, generating a second list of the actual variable/type pairs in the data frames to compare it to. This has the advantage of not just confirming the desired conversions took place, but also checks for any unintended changes to variables that did not require conversion.
Building the test code added a decent amount of work to the project, but now I have it working, it can be reused and scaled to future projects.
Non-whitespace leading and trailing characters were removed from all variable names and all character-type data fields.
Given that the data was not entered manually by the user, there was no way for me to manually check the correctness of all of the numeric values included in the data set. The values were instead checked against pre-determined limits to verify that they fell within realistic ranges, as detailed below.
Validating the data in this way also helps confirm the data makes sense in terms of the business logic, by confirming the data falls within realistic ranges given the capabilities of the devices and the types of data they claim to track.
All ID values in the data set were checked against their correct length according to the following specification
| ID Variable | Correct Length (Digits) |
|---|---|
| id | 10 |
| sleep_log_id | 11 |
| bodycomp_log_id | 13 |
The validation confirmed all values were the correct length.
To further validate the large number of ID data points in the data set, I checked the minimum and maximum ID values in each data frame. The intent was to identify any possible erroneous values within the acceptable length limits, e.g. all zeroes or all nines.
The validation confirmed all ID values were within a realistic range.
The data set was described as containing data from users collected between “03.12.2016-05.12.2016”: all records in the data set were validated against this date range. An intial check found dates just outside the range, dating up to 8am on 05.13.2016, the day after the data set supposedly ended. I didn’t consider this to be a problem, so the valid end-date was updated to the 14th of May 2016 accordingly.
The validation confirmed all date values were within the target range.
The validation code was split into generic checks that could be run on all numeric values, followed by code to analyse specific variables.
Sources for ranges:
There were a range of features to consider when comparing the FitBit devices in the dataset to the Bellabeat Ivy. Of these, I chose to focus this analysis on those features that were both available on the Ivy and available in the data set, either directly or by a suitable proxy data. This approach allowed me to focus my recommendations to the marketing team on those features that can actually be marketed to the public in the first place.
The existing marketing material for the Ivy states that Heart-rate Tracking can be “used to track your workout progress and optimize personal training routines,” and that the Exercise Tracking feature “will recognize your activity during the day, help you track up to 80 types of activity, count your steps, and discover how all that affects your body.” These features are closely related, since one of the primary functions of the HR tracking is to detect exercise intensity; steps and distance tracking can also be used to further analyse users’ activity.
In order to better understand how people are using similar functions on their FitBits, I analysed the quantities and qualities of the exercise that users in the data set were logging.
I started by getting a high-level breakdown of the time users were spending at different intensity levels:
Given how large the majority of non-Sedentary time is “Lightly Active”, and based on my understanding of the Intensity zones from previous sections, I assumed “Lightly Active” time included any activity more intense than sitting down, up to an including activities like walking for leisure. Anything more intense therefore fell into the “Fairly Active” or “Very Active” categories. With this in mind, I proceeded with the following definition:
“Active Time” is time logged in either the “Fairly Active” or “Very Active” intensity category, and is assumed to be time spent intentionally exercising.
Using this definition, I analysed users’ time spent intentionally exercising, i.e. their Active Time:
Here I analysed the data to see if there was any variation in the level of intensity of exercise between users. To do this I calculated each user’s Very Active Time as a proportion of their total Active Time: this gave me a single measure of how intensely each individual user was typically exercising:
The previous analysis showed that users were engaging in a range of intensity levels. I also wanted to identify if there was a trend in the cohort where users who exercised more often also exercised more intensely. This would help me characterise the user group: perhaps users who exercised harder tended to exercise longer; perhaps those who exercised longer were favouring lower-intensity activities, like walking; or perhaps there would be no pattern, indicating a variety of exercise preferences. To analyse this, I plotted the Very Active Proportion data from the previous analysis against total Active Time for each user:
Very Active Proportion was positively correlated with Active Time, with an R^2 value of 0.27. On visual inspection of the plot, the cohort appeared split into a majority group clustered around the lower end of Active Hours, and a smaller group with higher hours. Intuitively, this seemed to be skewing the scatter plot into the upper-right corner, and possibly contributing to a stronger correlation between hours and intensity. To put some numbers behind this, I re-drew the plot with the upper quintile of users removed:
With the upper quintile of the cohort removed, the correlation between hours and intensity dropped from 0.27 to 0.097. This supported my hypothesis that a small group of users was particularly active, both in terms of quanitity and intensity of exercise, while the remaining majority was spread out over a wider range.
One potential selling point for any health-oriented wearable device like FitBits is the idea that they actively encourage users to be more active. I wanted to see if this idea was borne out in the data: I knew that most users did not exercise much, on average, but did those values increase at all? If they did, there would be grounds to claim that device usage is at least associated with an increase in users’ physical activity.
To analyse this, I calculated the correlation coefficients between time and exercise hours, at each intensity level and for each user, which condesed the full months’ worth of data for each user into a single variable indicating their overall trend. These coefficents were then plotted as a histogram, allowing me to visualise the trends for all users at once:
The distributions of correlation coefficients were skewed slightly negative, peaking around -0.2 to -0.1. There was no indication that most users were increasing their exercise duration over time: in fact, the opposite seemed to be true.
As a final check, I re-ran the analysis, this time analysing the correlation between time and Very Active Proportion:
As with the overall categories, the correlation coefficients were mostly negative, with a Mean and Median of -0.04 and -0.08, respectively
As with overall exercise intensity, I analysed user preferences towards walking or running to help target my recommendations towards the activities users are already engaged in. For this, I made use of the step count source data. This data is logged every minute, so each data point equates to the user’s average speed, expressed in steps-per-minute.
For this analysis I defined several different walking speeds based on steps-per-minute, which were intended to be analagous to the four activity Intensity levels:
| Speed Range | Steps per Minute |
|---|---|
| Not Walking | 0 |
| Moderate Walking | 0 to 80 |
| Brisk Walking | 80 to 110 |
| Running | > 110 |
I also defined a value, “Active Walking”, to be used as an analog to the Active Time value used in the activity intensity analysis:
“Active Walking Time” is time logged in either the “Brisk Walking” or “Running” speed categories, and is assumed to be time spent intentionally walking, jogging, or running for exercise.
I then calculated the daily average times that users spent in each range:
As an alternative visualisation, I also plotted a histogram comparing just the non-immobile categories:
This plot breaks out the visualisation of the less-common walking speeds for easier comparison: it can be seen clearly that the majority of users are getting half an hour or less of Active Walking per day, but more than two hours of Moderate Walking per day.
Next, I analysed the quantity of Active Walking specifically that users were logging:
Finally, I analysed what proportion of Active Walking time that users were spending Running:
To analyse changes in user movement habits over time, I repeated the analysis of time correlation coefficients I performed on the activity intensity data, but for the walking speed categories I had defined:
Step counts are commonly used as a measure of overall activity levels. The typical figure provided for a healthy adult is 10,000 steps per day: and example can be found in this Australian Health Survey which states that “For adults, 10,000 steps is used by researchers worldwide as a reasonable estimate of daily activity by healthy adults.” For this analysis, I first analysed overall daily step counts in the cohort:
Examples of stationary exercise include gym-based activities like treadmill running and weight-lifting. Examples of mobile exercise include outdoor activities like jogging, but also indoor activities like squash or, depending on the accuracy of the FitBit distance tracking, gym workouts based on rotation through various equipment and exercises spaced around the venue. Determining if users have a preference for one activity type over another can help guide the marketing for the Ivy.
I analysed stationary versus mobile activity preferences by plotting the data for Active Minutes against Active Distance. The theory was that the more stationary exercise users do, the weaker the correlation between active time and active distance should be, because users are not moving while exercising (i.e. machine-based cardio and resistance training).
The “Readiness Score” feature of the Bellabeat Ivy makes use of several other features, including the Resting Heart Rate, Respiratory Rate, and Cardiac Coherence functions. The existing marketing material states that “While you sleep at night and your body is in its calmest state, Ivy measures your resting heart rate, respiratory rate, and cardiac coherence… In the morning, when synced with the Bellabeat app, Ivy calculates your readiness score – how ready you are for the day on a scale from 0 to 100.”
This score is one of several features of the Ivy that was not present on the FitBits in the available data, which made it a potential value-add feature, but also made it difficult to investigate existing user behaviours for. One important aspect that I could analyse was whether or not users currently wear their devices to bed.
Nighttime usage of FitBits was particularly relevant to the Readiness Score, as that function does not work unless the user wears their device to bed. If people were already frequently wearing their devices to bed, then the marketing could leverage this tendency to push the feature as an improvement on existing features. On the other hand, if users were not wearing their devices overnight, this might indicate that users were not interested in existing overnight-monitoring functions, or that there were other factors that made overnight wear unappealing, which marketing teams would need to take into account.
As of April 2016, the following wristband FitBits were available:
| Model | Sleep Logging | Heart-rate Tracking | Source |
|---|---|---|---|
| Alta | Yes | No | User Manual |
| Blaze | Yes | Yes | User Manual |
| Charge | Yes | No | User Manual |
| Charge HR | Yes | Yes | User Manual |
| Flex 1 | Yes | No | User Manual |
| Flex 2 | Yes | No | User Manual |
| Surge | Yes | Yes | User Manual |
Sleep logging is available on all devices, so sleep log data should indicate how the overall cohort uses this feature. Heart-rate logging is limited to later devices, so heartrate data will only indicate how that subset of the cohort uses that feature. With this in mind, I focused my analysis on the sleep logs.
For each user, I calculated the number of nights where the total sleep time logged was greater than 6 hours: this value was picked as being close enough to the recommended 8 hours of sleep to indicate the user did actually go to bed with their device on, but not so high that it would not pick up data from users who don’t get a full night’s sleep.
The sleep logs data gave a good indication of whether users were interested in sleep logging data, but it didn’t directly confirm whether people were wearing their devices to bed. To investigate this, I also analysed the heart-rate data for each user. This was by far the largest and least-processed data set, but it had the advantage of only generating data if the device was physically contacting the user at that time.
The heart rate data took the form of individual polls recording the user’s momentary heart rate. This presented a challenge compared to using the summary data, in that these individual logs had to be processed to determine when the user was wearing their device continuously. Simply counting the number of logs over a given time period would not work, as the time between logs varies even when the device is worn. Polling times less than 60 seconds, for example, are typically between 1 and 20 seconds, as shown below:
This meant that even choosing the median value of 5 seconds could result in an inferred “time logged” value that’s out by a factor of 4.
To build out the data, I decided on an approach where the time between logs was used to infer the time logged, but only within an acceptable time difference. If one log follows its predecessor by less than this difference, the time between the two is considered “Logged” time: if the delay is greater, we assume the device was taken off for that time. I selected 20 seconds as the maximum acceptable time between polls, based on the “Maximum” value from the boxplot of 17 seconds.
Given the small number of heart-rate users, I also compared the the heart-rate results directly to the sleep log results to see if there was any difference in the nights logged for individual IDs:
For those IDs present in both data sets, the results were almost identical, with only three users showing variation equating to 2-3 nights, indicating the method is valid.
TODO: Feature overview when it’s the start of the day, not the end
From the previous analysis for the Readiness Score feature, we know that sleep tracking is not commonly used, with a majority of 51% (17 users) logging 10% of nights or fewer, and the remainder spread out over a range up to 90% of nights.
Sleep logs in the data set rank sleep quality as Awake, Restless, or Asleep. To analyse the quantity of sleep in the cohort, I calculated the total time spent Asleep for each user:
To analyse the quantity of sleep in the cohort, I calculated the time spent Asleep as a percentage of total Sleep Time logged for each user:
I applied my previous analysis of time correlations to my measures for sleep quality and quantity:
I analysed sleep quantity and quality over time by calculating the coefficients with time of “total hours asleep” and “percentage of time asleep”, respectively:
Although the FitBits in the data set did have heart rate tracking, and some models from the period had a “Breathing Exercise” feature, there was no data for resting heart rate or cardiac coherence specifically. What I could analyse was the data on heart-rate tracking generally, asking questions including:
Answering these questions would help me understand whether or not the RHR and CC functions would appeal to the user base: for instance, users who rarely wear their existing heart-rate monitors, or only wear them during intense exercise, might be less likely to care about their longer-term cardiac performance.
From my previous analysis on heart-rate data, I knew that overnight heart-rate tracking was split, with 50% (7 users) logging sleep on 35% to 90% of nights, and the other 50% logging sleep on 0% to 15% of nights. I adapted this analysis of nighttime heart rate logging to investigate daytime usage:
First the naive approach: What percentage of various users’ time is spent in what heart-rate zone?
Several functions and features of the Ivy are not present on the FitBits in the data set. Given the lack of data, it’s not possible to directly analyse existing user behaviours with respect to these features, which include:
Based on the findings in the previous section, I present the following recommendations for the marketing strategy for the Bellabeat Ivy
Promote overnight device wear as a value-add practice, but highlight device comfort equally
The analysis indicated that the majority of users do not wear their devices overnight. This is a potential problem for marketing, as a number of key features of the device, including the Readiness Score, Resting Heart Rate tracking, Cardiac Coherence, and Sleep Tracking will function in a reduced capacity, or not at all, if the device cannot gather data from the user overnight. In the Help documentation for the Ivy it is explicitly stated that, for the Readiness Score feature, “To fully calibrate your baseline and get more reliable readiness score results you must wear your Ivy for 30 nights.”. Marketing materials should therefore highlight these features as beneficial to the user as a result of overnight wear to encourage this practice. This will in turn allow these functions to be positioned as value-add features that do more with the data that is gathered overnight than competing products.
One obvious reason why users may not be wearing their devices overnight is comfort. One of the FitBit manuals referenced in my research makes explicit mention of this, informing the user that “wearing your tracker 24/7 does not allow your skin to breathe”. The design of the Ivy may not be helping matters here: it has a more angular, protruding body than typical “watch”-type wearables, which may negatively impact overnight comfort.
!(Bellabeat Ivy on woman’s wrist)[https://bellabeat.com/wp-content/uploads/2022/05/RHR-1.jpg]
To mitigate these potential consumer concerns, the marketing should highlight the hypo-allergenic and hygenic aspects of the materials used in the Ivy, positioning these as keeping users comfortable during overnight wear.
Showcase users in a broad range of activities, rather than focusing on athleticism
To appeal to the FitBit market, marketing materials for the Ivy’s exercise tracking features should highlight a broad range of activities ranging from gentle to more intense. The marketing should not target highly-athletic groups, as these represent only a small part of the user base.
Focus on mobile activities over stationary activities
Users were found to preference mobile activities, with very little evidence of high-intensity, low-distance exercise in the data set. Marketing materials should therefore focus on mobile, outdoor activites like light jogging, walking dogs, and cycling, instead of stationary activites like gym workouts.
Focus on walking over running
The data showed that running was not particularly popular in the user group, so marketing materials can highlight walking over running, perhaps using the additional focus to highlight a range of people and locations.
Avoid implying that device usage will make users exercise more or sleep better
The marketing for the Ivy should avoid any claims that users will increase their activity levels, their step counts, or their sleep quantity or quality as a result of their purchasing the device, as the analysis does not support this claim. The analysis actually showed the opposite effect, with users typically decreasing on these metrics.
The analysis did find that most users were consistently getting a good amount of high-quality sleep, even if this did not improve over time. The Sleep Tracking can therefore be marketed more as a way of tracking and reinforcing the user’s existing good sleep habits than as a way of improving them. If a manually-activated sleep tracking mode exists, this should be highlighted as well, as users may get more-representative sleep logs if they include time spent trying to get to sleep, rather than relying on the device to detect sleep and only then start logging.
Highlight the benefits of daytime Resting Heart Rate data
The analysis found that, while most users do not log heart rate data overnight, most do log heart rate data during the day. This indicates that, while the primary benefit of the Resting Heart Rate (RHR) function comes from recording deep sleep overnight, there is still an angle to promote the function on the basis of daytime usage. Daytime RHR can be promoted as a measure of overall health or even a target in its own right: for example, some users may wish to track a lowering daytime RHR as an indicator of improved cardiac performance during a training regime. These may not be as easy a sell as the overnight RHR, but they have the advantage of aligning better with existing user behaviours.
# Set up knitting options with the knitr package
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
results = 'hide',
warning = FALSE
)
# Load all required packages
print("Loading packages...")
rqd_pkgs <- c(
"dplyr",
"anytime",
"tidyverse",
"janitor",
"readr",
"skimr",
"tibble",
"ggplot2",
"gridExtra",
"ggpubr",
"kableExtra",
"purrr"
)
lapply(rqd_pkgs, function(pkg) {
if(!requireNamespace(pkg, quietly = FALSE)) {
cran_mirror <- "https://cran.r-project.org"
install.packages(as.character(pkg), repos = cran_mirror)
}
library(pkg, character.only = TRUE)
})
rm(rqd_pkgs)
print("Loading packages complete.")
# Set the working directory of the R markdown environment to match that of the console
dir <- "/Users/nathanweaver/Documents/GitHub/Bellabeat Case Study"
setwd(dir)
cat("Set working directory to:", dir, "\n")
# Load all CSVs in the source directory into the environment
csv_dir <- "Fitabase_Data_Cleaned"
paths_dfs <- list.files(csv_dir, pattern = "*.csv", full.names = TRUE)
df_names <- paths_dfs %>%
basename() %>%
tools::file_path_sans_ext()
for (i in 1:length(df_names)) {
assign(df_names[i], read_csv(paths_dfs[i]))
}
rm(i, paths_dfs)
# Display all unique Exercise Intensity values
print(unique(intensity_src_mins_tall$Intensity))
# Reconstruct the summary intensity data for comparison with the original
activity_sum_days_wide_test <- intensity_src_mins_tall %>%
mutate(activity_date_floored = floor_date(mdy_hms(ActivityMinute), unit = "days")) %>%
group_by(Id, activity_date_floored) %>%
summarize(
minutes_sedentary = sum(case_when(Intensity == 0 ~ 1, TRUE ~ 0)),
minutes_lightly_active = sum(case_when(Intensity == 1 ~ 1, TRUE ~ 0)),
minutes_fairly_active = sum(case_when(Intensity == 2 ~ 1, TRUE ~ 0)),
minutes_very_active = sum(case_when(Intensity == 3 ~ 1, TRUE ~ 0))
) %>%
mutate(Id_ActivityDate_UID = paste(Id, activity_date_floored, sep = "_")) %>%
arrange(Id, activity_date_floored, Id_ActivityDate_UID)
# Compare both versions of the data and return any dates with different values
intensity_daily_comp <- activity_sum_days_wide %>%
mutate(ActivityDate_floored = floor_date(mdy(ActivityDate), unit = "days")) %>%
mutate(Id_ActivityDate_UID = paste(Id, ActivityDate_floored, sep = "_")) %>%
with(merge(
.,
activity_sum_days_wide_test,
by = c("Id_ActivityDate_UID"),
all = TRUE
)
) %>%
mutate(diff_minutes_sedentary = minutes_sedentary - SedentaryMinutes ) %>%
mutate(diff_minutes_lightly_active = minutes_lightly_active - LightlyActiveMinutes ) %>%
mutate(diff_minutes_fairly_active = minutes_fairly_active - FairlyActiveMinutes ) %>%
mutate(diff_minutes_very_active = minutes_very_active - VeryActiveMinutes ) %>%
select(
Id_ActivityDate_UID,
diff_minutes_sedentary,
diff_minutes_lightly_active,
diff_minutes_fairly_active,
diff_minutes_very_active
) %>%
filter(!(diff_minutes_sedentary == 0 &
diff_minutes_lightly_active == 0 &
diff_minutes_fairly_active == 0 &
diff_minutes_very_active == 0)
) %>%
arrange(Id_ActivityDate_UID)
cat("Printing all rows with non-zero diff values:\n")
glimpse(intensity_daily_comp)
rm(activity_sum_days_wide_test, intensity_daily_comp)
# Display all unique Sleep Value values
print(unique(sleep_src_mins_tall$value))
# Reconstruct the summary sleep data for comparison with the original
sleep_sum_days_wide_test <- sleep_src_mins_tall %>%
# mutate(date_typed = mdy_hms(date)) %>%
mutate(date_floored = floor_date(mdy_hms(date), unit = "days")) %>%
# Sum time asleep for each Log ID
group_by(logId) %>%
summarize(
"Id" = min(Id),
# Associate each Log ID with the latest date recorded under it
"SleepDay" = max(date_floored),
"minutes_in_bed" = n(),
"minutes_awake" = sum(value == 3),
"minutes_restless" = sum(value == 2),
"minutes_asleep" = sum(value == 1)) %>%
# Sum time asleep for each date based on SleepDay
group_by(Id, SleepDay) %>%
summarize(
"TotalSleepRecords_2" = n(),
"TotalMinutesAsleep_2" = sum(minutes_asleep),
"TotalTimeInBed_2" = sum(minutes_in_bed),
"TotalMinutesAwake" = sum(minutes_awake),
"TotalMinutesRestless" = sum(minutes_restless),
) %>%
mutate("Id_SleepDay_UID" = paste(Id, SleepDay, sep = "_")) %>%
arrange(Id_SleepDay_UID)
# Compare both versions of the data and return any dates with different values
sleepDay_comp <- sleep_sum_days_wide %>%
# mutate("SleepDay_typed" = mdy_hms(SleepDay)) %>%
mutate("SleepDay_floored" = floor_date(mdy_hms(SleepDay), unit = "days")) %>%
mutate("Id_SleepDay_UID" = paste(Id, SleepDay_floored, sep = "_")) %>%
arrange(Id_SleepDay_UID) %>%
with(merge(
.,
sleep_sum_days_wide_test,
by = c("Id_SleepDay_UID"),
all = TRUE
)
) %>%
mutate(recordDiff = TotalSleepRecords_2 - TotalSleepRecords) %>%
mutate(sleepDiff = TotalMinutesAsleep_2 - TotalMinutesAsleep) %>%
mutate(bedDiff = TotalTimeInBed_2 - TotalTimeInBed) %>%
select(
Id_SleepDay_UID,
recordDiff,
sleepDiff,
bedDiff
) %>%
filter(!(recordDiff == 0 & sleepDiff == 0 & bedDiff == 0)) %>%
arrange(Id_SleepDay_UID)
cat("Printing all rows with non-zero diff values:\n")
glimpse(sleepDay_comp)
rm(sleep_sum_days_wide_test, sleepDay_comp)
prep_apdx_labs <- grep("^prep-", knitr::all_labels(), value = TRUE)
prep_apdx_labs <- grep("^prep-", knitr::all_labels(), value = TRUE)
invisible(lapply(prep_apdx_labs, print_chunk_with_label))
# Remove all NULL and empty-string values
cat("Checking for NULL/empty values...\n", sep="")
for (df_name in df_names) {
df <- get(df_name)
num_nulls <- sum(is.na(df))
if(num_nulls) {
cat(df_name,": ",num_nulls," NULLs","\n",sep="")
}
# Check for empty strings (which do not show up as NULLs)
empty_strings <- df %>%
filter(if_any(where(is.character), ~ nchar(.) == 0))
num_empty_strings = nrow(empty_strings)
if(num_empty_strings) {
cat(df_name,": ",num_empty_strings," empty strings","\n",sep="")
}
}
cat("Checking for NULL/empty values done.\n", sep="")
rm(df, df_name, num_nulls, empty_strings, num_empty_strings, df_name)
# Declare all functions and global variables required for cleaning
rename_df_variables <- function(df_name, var_mods) {
cat("Renaming variables in \"",df_name,"\"...\n", sep = "")
df <- get(df_name)
# Check each var name requiring correction against the var names in the df
for (i in 1:nrow(var_mods)) {
var_old = var_mods$var_old[i]
if (!(var_old %in% colnames(df))) {
next
}
# If found, make sure the conversion is applicable to this or all dfs
tbl <- var_mods$tbl[i]
if (tbl != "" && tbl != df_name) {
next
}
# Perform the conversion if all checks passed
var_new = var_mods$var_new[i]
cat("df: ",df_name, "\tvar_old: ",var_old,"\t",sep="")
cat("var_new: ",var_new,"\t", sep="")
cat("tbl: ",tbl," ", sep="")
cat("Replacing... ", sep = "")
df <- df %>% rename(!!var_new := !!var_old)
cat("Done.\n", sep = "")
}
rm(i)
cat("Renaming variables in \"",df_name,"\" complete.\n", sep = "")
return(df)
}
var_mods <- data.frame(
var_old = character(0),
var_new = character(0),
type_new = character(0),
tbl = character(0)
)
# WARNING: Ensure table-specific modifications (tbl != "") are positioned above non-specific modifications with matching var_old/var_new values: only the first modification in the list will be applied to matching variables.
var_mods <- var_mods %>%
rbind(.,data.frame(var_old="date", var_new="bodycomp_datetime", type_new="POSIXct", tbl="bodycomp_src_logs_wide")) %>%
rbind(.,data.frame(var_old="time", var_new="heart_rate_second", type_new="POSIXct", tbl="heartrate_src_seconds_tall")) %>%
rbind(.,data.frame(var_old="value", var_new="heart_rate", type_new="", tbl="heartrate_src_seconds_tall")) %>%
rbind(.,data.frame(var_old="date", var_new="sleep_minute", type_new="POSIXct", tbl="sleep_src_mins_tall")) %>%
rbind(.,data.frame(var_old="value", var_new="sleep_rank", type_new="", tbl="sleep_src_mins_tall")) %>%
rbind(.,data.frame(var_old="", var_new="activity_hour", type_new="POSIXct", tbl="")) %>%
rbind(.,data.frame(var_old="", var_new="activity_minute", type_new="POSIXct", tbl="")) %>%
rbind(.,data.frame(var_old="", var_new="sleep_day", type_new="POSIXct", tbl="")) %>%
rbind(.,data.frame(var_old="", var_new="id", type_new="character", tbl="")) %>%
rbind(.,data.frame(var_old="log_id", var_new="sleep_log_id", type_new="character", tbl="sleep_src_mins_tall")) %>%
rbind(.,data.frame(var_old="log_id", var_new="bodycomp_log_id", type_new="character", tbl="bodycomp_src_logs_wide")) %>%
rbind(.,data.frame(var_old="activity_date", var_new="activity_day", type_new="Date", tbl="")) %>%
rbind(.,data.frame(var_old="fairly_active_distance", var_new="distance_fairly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="fairly_active_minutes", var_new="minutes_fairly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="light_active_distance", var_new="distance_lightly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="light_active_minutes", var_new="minutes_lightly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="lightly_active_distance", var_new="distance_lightly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="lightly_active_minutes", var_new="minutes_lightly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="logged_activities_distance", var_new="distance_logged_activities", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="me_ts", var_new="mets", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="moderately_active_distance", var_new="distance_moderately_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="sedentary_active_distance", var_new="distance_sedentary", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="sedentary_distance", var_new="distance_sedentary", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="sedentary_active_minutes", var_new="minutes_sedentary", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="sedentary_minutes", var_new="minutes_sedentary", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="step_total", var_new="steps_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_distance", var_new="distance_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_intensity", var_new="intensity_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_minutes_asleep", var_new="minutes_asleep_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_sleep_records", var_new="sleep_records_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_steps", var_new="steps_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_time_in_bed", var_new="minutes_in_bed_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="tracker_distance", var_new="distance_tracker", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="very_active_distance", var_new="distance_very_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="very_active_minutes", var_new="minutes_very_active", type_new="", tbl=""))
# Clean and update all variables in data set
cat("Cleaning variable names...\n", sep = "")
for(df_name in df_names) {
cat("Cleaning ",df_name,"...\n", sep = "")
assign(df_name, get(df_name) %>% clean_names())
}
cat("Cleaning variable names complete.\n", sep = "")
cat("Renaming variables...\n", sep = "")
var_mods_rename <- var_mods %>%
filter(var_old != "" & var_new != "")
for(df_name in df_names) {
assign(df_name, rename_df_variables(df_name, var_mods_rename))
}
rm(df_name)
cat("Renaming variables complete.\n", sep = "")
# Remove all duplicate data in data set
# This function was tested using a prototype version that generated dataframes of all of the apparent duplicate rows: these were verified manually before the function was allowed to modify the actual dataframes. The logic was verified again by re-running it after the initial removal: all dataframes reported zero duplicates on the second run, confirming the success of the first pass.
cat("Checking for duplicate values...\n",sep="")
for (df_name in df_names) {
cat("Checking ",df_name," for duplicates... ",sep="")
df <- get(df_name)
if (!anyDuplicated(df)) {
cat("0 duplicates removed. Done.\n",sep="")
} else {
nrow_before <- nrow(df)
df <- distinct(df)
nrow_after <- nrow(df)
cat("Removing ",(nrow_before - nrow_after)," duplicates... ",sep="")
assign(df_name, df)
cat("Done.\n",sep="")
}
}
rm(df, df_name, nrow_before, nrow_after)
cat("Checking for duplicate values complete.\n",sep="")
# Update non-factor variables
# Declare required functions and global variables ----
var_mods_recast <- var_mods %>%
filter(type_new != "")
are_identical_lists <- function(list1, list2) {
if (length(list1) != length(list2)) {
return(FALSE)
}
for (i in seq_along(list1)) {
if(!identical(list1[[i]], list2[[i]])) {
cat("Non-identical lists at list1[",i,"]. Exiting.\n", sep = "")
print(list1[[i]])
print(list2[[i]])
rm(i)
return(FALSE)
}
}
rm(i)
return(TRUE)
}
get_df_var_types <- function(df_name) {
cat("Getting current variable types for ", df_name, "... ", sep = "")
df <- get(df_name)
var_types <- data.frame(
var = names(df),
type = sapply(df, function(col) class(col)[1])
)
cat("Done.\n", sep = "")
return(var_types)
}
get_df_target_var_types <- function(df_name) {
cat("Getting target variable types for ", df_name, "...\n", sep = "")
var_types <- get_df_var_types(df_name)
# Iterate over rows in current var_types
for (var_row in 1:nrow(var_types)) {
# Check if var name is in var_mods_recast
var_name <- var_types$var[var_row]
for(mods_row in 1:nrow(var_mods_recast)) {
var_name_mods <- var_mods_recast$var_new[mods_row]
# If no, skip: if yes, replace type with target
if(var_name == var_name_mods) {
type_new <- var_mods_recast$type_new[mods_row]
cat("Updated \"", var_name, "\" from ", var_types$type[var_row], " to ", var_mods_recast$type_new[mods_row], ".\n", sep = "")
var_types$type[var_row] <- type_new
}
}
}
cat("Getting target variable types for ", df_name, " done.\n", sep = "")
return(var_types)
}
recast_variables <- function(df_name) {
df <- get(df_name)
for (i in 1:nrow(var_mods_recast)) {
var_new <- var_mods_recast$var_new[i]
type_new <- var_mods_recast$type_new[i]
if (var_new %in% colnames(df)) {
cat("Converting ",df_name,"$",var_new," to ",type_new, "... ", sep = "")
if (type_new == "character") {
df <- df %>% mutate("{var_new}" := as.character(!!sym(var_new)))
} else if (type_new == "Date") {
df <- df %>% mutate("{var_new}" := mdy(!!sym(var_new)))
} else if (type_new == "POSIXct") {
df <- df %>% mutate("{var_new}" := mdy_hms(!!sym(var_new)))
} else {
cat("type_new not found: not converting.", sep = "")
}
cat("Done.\n", sep = "")
}
}
rm(i)
return(df)
}
# Recast variables ----
cat("Generating list of target column types for testing...\n", sep = "")
df_types_target <-lapply(df_names, get_df_target_var_types)
names(df_types_target) <- df_names
cat("Generating list of target column types complete.\n", sep = "")
cat("Recasting variables...\n", sep = "")
for (df_name in df_names) {
cat("Recasting ",df_name,"...\n", sep = "")
assign(df_name, recast_variables(df_name))
cat("Converting ",df_name," complete.\n", sep = "")
}
cat("Recasting variables complete.\n", sep = "")
# Test recasting of variables ----
cat("Generating list of updated column types...\n", sep = "")
df_types_after <- lapply(df_names, get_df_var_types)
names(df_types_after) <- df_names
cat("Generating list of updated column types complete.\n", sep = "")
cat("Checking updated column types against target types...\n", sep = "")
test_succeeded <- are_identical_lists(df_types_after, df_types_target)
cat("Checking updated column types against target types complete.\n", sep = "")
cat("Data recasting ", case_when(test_succeeded ~ "succeeded", TRUE ~"failed"), ".", sep = "")
rm(df_name, df_types_target, df_types_after, test_succeeded)
# Recast unique factor variables individually
sleep_src_mins_tall <- sleep_src_mins_tall %>%
mutate(sleep_rank = factor(sleep_rank, levels = 1:3, labels = c("Asleep", "Restless", "Awake")))
intensity_src_mins_tall <- intensity_src_mins_tall %>%
mutate(intensity = factor(intensity, levels = 0:3, labels = c("Sedentary", "Lightly Active", "Fairly Active", "Very Active")))
# Trim variable names
cat("Trimming whitespace in variable names...\n", sep="")
for (df_name in df_names) {
df <- get(df_name)
for (col_name in colnames(df)) {
col_name_trimmed <- str_trim(col_name)
cat("Trimming ",df_name, "[",col_name,"] to ",col_name_trimmed,"... ",sep="")
df <- df %>% rename(!!col_name := !!col_name_trimmed)
cat("Done.\n", sep = "")
}
}
rm(df, df_name, col_name, col_name_trimmed)
cat("Trimming whitespace in variable names complete.\n", sep="")
# Trim data
trim_chr_column <- function(col) {
if (is.character(col)) {
return(str_trim(col))
} else {
return(col)
}
}
cat("Trimming whitespace in character-type data values...\n", sep="")
for (df_name in df_names) {
df <- get(df_name)
for (col_name in colnames(df)) {
if (is.character(df[[col_name]])) {
cat("Trimming ",df_name, "[",col_name,"]... ",sep="")
# df <- df %>% mutate(if_any(where(is.character), trim_chr_column))
df <- df %>% mutate(across(all_of(col_name), str_trim))
cat("Done.\n", sep = "")
}
}
}
rm(df, df_name, col_name)
cat("Trimming whitespace in character-type data values complete.\n", sep="")
# Validate ID lengths
validate_string_length <- function(df_name, col_name, valid_length) {
cat("Checking ",df_name," for invalid ",col_name," values... ",sep="")
df <- get(df_name)
if (col_name %in% colnames(df)) {
invalid_values <- df %>% select(!!sym(col_name)) %>% filter(nchar(!!sym(col_name)) != valid_length)
}
if (exists('invalid_values') && nrow(invalid_values) > 0) {
cat(nrow(invalid_values)," invalid values found:\n",sep="")
glimpse(invalid_values)
rm(invalid_values)
} else {
cat("complete.\n",sep="")
}
rm(df)
}
result <- map(df_names, ~validate_string_length(.x, col_name = "id", valid_length = 10))
result <- map(df_names, ~validate_string_length(.x, col_name = "sleep_log_id", valid_length = 11))
result <- map(df_names, ~validate_string_length(.x, col_name = "bodycomp_log_id", valid_length = 13))
rm(result)
# Validate ID ranges
cat("Checking min/max ID value ranges...\n",sep="")
id_col_names <- c("id", "sleep_log_id", "bodycomp_log_id")
for (df_name in df_names) {
df <- get(df_name)
for (col_name in id_col_names) {
if(col_name %in% colnames(df)) {
cat(df_name,"[",col_name,"] Min = ",min(df[[col_name]])," Max = ",max(df[[col_name]]),".\n",sep="")
}
}
}
rm(df, df_name, col_name, id_col_names)
cat("Checking min/max ID value ranges complete.\n",sep="")
# Validate date data
validate_dates <- function(df_name) {
valid_dates_stt <- as.Date("2016-03-12", format = "%Y-%m-%d")
valid_dates_end <- as.Date("2016-05-14", format = "%Y-%m-%d")
cat("Validating dates for ",df_name,"... ",sep="")
date_columns <- get(df_name) %>%
select_if(function(col) is.POSIXct(col) || is.Date(col))
if (ncol(date_columns) == 0) {
cat("Error: no date column found.\n",sep="")
} else if (ncol(date_columns) > 1) {
cat("Error: more than one date column found for ",df_name,".\n",sep="")
} else {
outside_range <- date_columns %>%
filter(if_any(everything(), ~ . < valid_dates_stt | . > valid_dates_end))
if (nrow(outside_range) > 0) {
cat("found ",nrow(outside_range)," invalid values:\n",sep="")
print(outside_range)
rm(outside_range)
} else {
cat("complete.\n")
}
}
rm(date_columns)
}
result <- lapply(df_names, validate_dates)
rm(result)
# Validate all numeric data
validate_numerics <- function(df_name) {
# This function performs all checks on numeric values that are required in more than one data-frame, e.g. non-negativity and summation
cat("Validating numerics for ",df_name,"... ",sep="")
numerics <- get(df_name) %>% select_if(is.numeric)
if (ncol(numerics) == 0) {
cat("No numeric variables found.\n",sep="")
} else {
# Check for negative values
negative_values <- numerics %>%
filter(if_any(everything(), ~ . < 0))
if (nrow(negative_values) > 0) {
cat("found ",nrow(negative_values)," invalid values:\n",sep="")
print(negative_values)
} else {
cat("complete.\n")
}
rm(negative_values)
# Check for summation
}
rm(numerics)
}
result <- lapply(df_names, validate_numerics)
rm(result)
# Perform additional validation on specific numeric data
# Note: The maximum values given are used as thresholds above which values may not be realistic, not as hard limits for acceptability: a heart-rate of 200bpm, for example, is entirely possible, but it is high enough that I would want to check if the data point corresponded to a period of high-intensity exercise.
# For a given list of dfs, column names, and a min-max range, check all matching columns in all matching dfs against that range
validate_within_range <- function(df_name, column_names, range_min, range_max) {
df <- get(df_name)
for (col_name in column_names) {
cat("Checking ranges for ",df_name,"[",col_name,"]... ",sep="")
if (!(col_name %in% colnames(df))) {
cat("column not found.\n")
} else {
out_of_range <- df %>%
filter(!!sym(col_name) < range_min | !!sym(col_name) > range_max)
if (nrow(out_of_range) <= 0) {
cat("complete.\n",sep="")
} else {
cat("Found ",nrow(out_of_range)," out-of-range values:\n",sep="")
glimpse(out_of_range)
}
rm(out_of_range)
}
}
rm(df, col_name)
}
# Minute summation
valid_df_names <- c(
"activity_sum_days_wide",
"intensity_sum_days_wide",
"sleep_sum_days_wide")
valid_col_names <- c(
"minutes_very_active",
"minutes_fairly_active",
"minutes_lightly_active",
"minutes_sedentary",
"minutes_asleep_total",
"minutes_in_bed_total")
range_max <- 60 * 12 # Minutes in half a day
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Weights (kg)
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("weight_kg")
range_max <- 150 # Arbitrarily chosen as high enough to be a potentially erroneous value
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Weights (pounds, same limit as for weight in kilos)
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("weight_pounds")
kg2lb <- 2.204623
range_max <- 150 * kg2lb # Arbitrarily chosen as high enough to be a potentially erroneous value
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
rm(kg2lb)
# BMI
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("bmi")
range_max <- 40.0 # Corresponds with the WHO "Obese (Class III)" weight category
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Distances
valid_df_names <- c(
"activity_sum_days_wide",
"intensity_sum_days_wide")
valid_col_names <- c(
"distance_lightly_active",
"distance_logged_activities",
"distance_moderately_active",
"distance_sedentary",
"distance_total",
"distance_tracker",
"distance_very_active")
range_max <- 21.08241 # Equivalent to one half-marathon
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Step counts
valid_df_names <- c(
"activity_sum_days_wide",
"steps_sum_days_tall",
"steps_sum_hours_tall",
"steps_src_mins_tall")
valid_col_names <- c(
"steps",
"steps_total")
range_max <- 14800 # Set to double the average daily step count for Australian adults
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Calories
valid_df_names <- c(
"activity_sum_days_wide",
"calories_src_mins_tall",
"calories_sum_days_tall",
"calories_sum_hours_tall")
valid_col_names <- c("calories")
range_max <- 4000 # Chosen arbitrarily as double the typically-recommended daily caloric intake
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Heart Rates
valid_df_names <- c("heartrate_src_seconds_tall")
valid_col_names <- c("heart_rate")
range_max <- 200 # Chosen based on average 100% heart-rate for a 20 y.o. (Source: American Heart Foundation)
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
rm(range_max, result, valid_df_names, valid_col_names)
clean_apdx_labs <- grep("^clean-", knitr::all_labels(), value = TRUE)
# Declare generic functions for use in the analysis below
round_data_to_bin <- function(data, bin_width) {
rounding_factor <- 1 / bin_width
return(round(data * rounding_factor) / rounding_factor)
}
get_histogram_max_count <- function(data, bin_width) {
rounding_factor <- 1 / bin_width
# Round data to nearest multiple of bin_width
data_rounded <- round_data_to_bin(data, bin_width)
max_count <- max(table(data_rounded))
return(max_count)
}
rescale_plot <- function(p, x_min = 0, x_max = 10, x_step = 1, y_min = 0, y_max = 10, y_step = 1) {
p <- p +
scale_x_continuous(breaks = seq(x_min, x_max, by=x_step),
labels = scales::comma_format(),
limits=c(x_min, x_max)) +
scale_y_continuous(breaks = seq(y_min, y_max, by=y_step),
labels = scales::comma_format(),
limits=c(y_min, y_max))
return(p)
}
plot_histo_pareto <- function(data, bin_width) {
# Cumulative data uses same bins as histogram
data <- sort(data)
data_rounded <- round_data_to_bin(data, bin_width)
# Calculate highest count so axes can be adjusted
highest_count <- get_histogram_max_count(data, bin_width)
data_pareto <- seq(1, length(data), by = 1) / length(data) * highest_count
# Calculate stats for overlay on histogram
data_max <- max(data)
data_mean <- round(mean(data), digits = 2)
data_median <- round(median(data), digits = 2)
sum_stats <- data.frame(Statistics = c("Mean", "Median"),
value = c(data_mean, data_median))
label_text <- paste("Mean =", data_mean, ", Median =", data_median)
p <- ggplot() +
geom_histogram(aes(x = data),
color = "white",
binwidth = bin_width) +
geom_line(aes(x = data_rounded,
y = data_pareto),
color="forestgreen") +
geom_vline(data = sum_stats,
aes(xintercept = value,
linetype = Statistics,
color = Statistics),
size = 1) +
scale_x_continuous(breaks = seq(0, data_max + bin_width, by = bin_width)) +
scale_y_continuous(name = "Users",
breaks = seq(0, highest_count, by = 1),
sec.axis = sec_axis(~./highest_count, name = "Cumulative Percentage of Users")) +
# scale_linetype_manual(values = c("solid", "dashed", "solid"),
# labels = c("Mean", "Median", "Cumulative")) +
# scale_color_manual(name = "Legend",
# values = c("red", "blue", "forestgreen")) +
# annotate("text", x = Inf, y = Inf, label = label_text,
# hjust = 1, vjust = 1, size = 4) +
labs(x = "Value",
caption = label_text)
return(p)
}
get_time_coefficients <- function(ids, timestamps, values) {
tbl <- tibble(
id = ids,
timestamp = timestamps,
value = values
)
coeffs <- tbl %>%
mutate(day_of_year = yday(timestamp)) %>%
group_by(id) %>%
summarize(correlation = cor(day_of_year, value))
return(coeffs)
}
plot_time_coefficients <- function(coeffs) {
bin_width <- 0.1
max_count <- get_histogram_max_count(coeffs$correlation, bin_width)
coeffs_mean <- round(mean(coeffs$correlation, na.rm = TRUE), digits = 2)
coeffs_median <- round(median(coeffs$correlation, na.rm = TRUE), digits = 2)
sum_stats <- data.frame(Statistics = c("Mean", "Median"),
value = c(coeffs_mean, coeffs_median))
label_text <- paste("Mean =", coeffs_mean, "\nMedian =", coeffs_median)
ggplot(coeffs, aes(x = correlation)) +
geom_histogram(binwidth = bin_width, color = "white") +
geom_vline(data = sum_stats,
aes(xintercept = value,
linetype = Statistics,
color = Statistics),
size = 1) +
scale_x_continuous(breaks = seq(-1.0, 1.0, by=bin_width)) +
scale_y_continuous(breaks = seq(0, max_count, by=1)) +
annotate("text", x = Inf, y = Inf, label = label_text,
hjust = 1, vjust = 1, size = 4) +
labs(title = "Correlation Coefficient with Time",
caption = "Positive values imply variable of interest generally increased over time.",
x = "Correlation Coefficient",
y = "Count")
}
get_time_coefficients_plot <- function(ids, timestamps, values) {
coeffs <- get_time_coefficients(ids, timestamps, values)
p <- plot_time_coefficients(coeffs)
return(p)
}
wide_to_stacked_bar_plot <- function(data_wide, key, value, key_order) {
if(!("id" %in% colnames(data_wide))) {
print("ERROR: data does not include \"id\" column: cannot convert.")
} else {
# Convert the data from wide to long. Set the factor levels to control the stacking order of the bars
data_long <- data_wide %>%
tidyr::gather(key = !!sym(key), value = !!sym(value), -id) %>%
mutate(!!sym(key) := factor(!!sym(key), levels = key_order))
# Order IDs in wide data based on value of first key, then rearrange long data.
# This ensures the resultant plot sorts the IDs in ascending order of the first key
first_key <- key_order[1]
data_wide <- data_wide %>% arrange(!!sym(first_key))
data_long$id <- factor(
data_long$id,
levels = data_wide$id)
# Plot graph with angled ID labels
p <- ggplot(data_long, aes(x= id, y= !!sym(value), fill= !!sym(key))) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# scale_y_continuous(breaks = seq(0, 24, by = 1))
}
}
scatter_with_LOBF <- function(data_x, data_y, labels = NULL) {
data <- tibble(
x = data_x,
y = data_y
)
p <- ggplot(data,
aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
color = "blue") +
stat_cor(mapping=aes(label=..rr.label..),
method="pearson",
label.x=-Inf,
label.y=Inf,
hjust = -0.1,
vjust = 1.1) +
geom_text(aes(label = sprintf("Gradient: %.2f", coef(lm(y ~ x, data = data))[2])),
x = -Inf,
y= Inf,
hjust = -0.05,
vjust = 3.5)
if(!is.null(labels) && length(labels > 1)) {
p <- p + labs(title = paste(labels[1], "vs.", labels[2]),
x = labels[1],
y = labels[2])
}
return(p)
}
# For each ID, generate averages for the three non-sedentary intensity levels
mean_daily_intensities_wide <- intensity_sum_days_wide %>%
group_by(id) %>%
summarize("sedentary" = mean(minutes_sedentary) / 60,
"lightly_active" = mean(minutes_lightly_active) / 60,
"fairly_active" = mean(minutes_fairly_active) / 60,
"very_active" = mean(minutes_very_active) / 60)
#Convert data to long-format for plotting as a stacked-bar chart
intensity_order = c("sedentary", "lightly_active", "fairly_active", "very_active")
mean_daily_intensities_long <- mean_daily_intensities_wide %>%
tidyr::gather(key = "intensity", value = "mean_hours", -id) %>%
mutate(intensity = factor(intensity, levels = intensity_order))
# Reorder IDs in order of "Very Active" time
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
mutate("total_active" = fairly_active + very_active) %>%
arrange(total_active)
# Apply order of IDs to long-format data to force bar stacking order
mean_daily_intensities_long$id <- factor(
mean_daily_intensities_long$id,
levels = mean_daily_intensities_wide$id)
viz_avg_time_by_intensity <- ggplot(mean_daily_intensities_long, aes(x= id, y= mean_hours, fill= intensity)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(breaks = seq(0, 24, by = 1)) +
labs(title="Average Daily Time Logged by Intensity Zone",
x = "User ID",
y = "Total Average Daily Time Logged (hours)")
print(viz_avg_time_by_intensity)
# Update total_active to reflect new definition
# mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
# mutate(total_active = fairly_active + very_active) %>%
# arrange(total_active)
viz_active_hrs_per_week <- plot_histo_pareto(mean_daily_intensities_wide$total_active, bin_width = 0.25) +
labs(title = "Active Hours per Week",
x = "Hours")
print(viz_active_hrs_per_week)
# For each ID, generate proportion Very Active time
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
mutate(proportion_very_active = ifelse(total_active > 0,
very_active / total_active,
0))
# Visualise proportion Very Active time
viz_very_active_proportion <- plot_histo_pareto(mean_daily_intensities_wide$proportion_very_active,
bin_width = 0.1) +
labs(title = "Percentage of Active Time spent Very Active",
x = "% Very Active")
print(viz_very_active_proportion)
mean_daily_intensities_wide_no_lower_quint <- mean_daily_intensities_wide %>%
arrange(total_active) %>%
head(ceiling(nrow(mean_daily_intensities_wide)*0.2))
mean_daily_intensities_wide_no_upper_quint <- mean_daily_intensities_wide %>%
filter(total_active < 0.75)
viz_very_active_vs_total_active <- ggplot(mean_daily_intensities_wide, aes(x=total_active, y=proportion_very_active)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
color = "blue") +
stat_cor(aes(label=..rr.label..),
label.x=Inf,
hjust = 1,
label.y= Inf,
vjust = 1) +
scale_x_continuous(limits=c(0,2)) +
scale_y_continuous(limits=c(0,1)) +
labs(title="Proportion Very Active vs. Total Active Time",
x = "Average Weekly Active Time (hours)",
y = "Proportion of Very Active Time (%)")
print(viz_very_active_vs_total_active)
four_quintiles_length <- floor(nrow(mean_daily_intensities_wide)*0.8)
viz_very_active_vs_total_active_no_upper_quint <- ggplot(
data = mean_daily_intensities_wide %>%
arrange(total_active) %>%
head(four_quintiles_length),
aes(x=total_active, y=proportion_very_active)
) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
color = "blue") +
stat_cor(aes(label=..rr.label..),
label.x=Inf,
hjust = 1,
label.y= Inf,
vjust = 1) +
scale_x_continuous(limits=c(0,2)) +
scale_y_continuous(limits=c(0,1)) +
labs(title="Proportion Very Active vs. Total Active Time",
subtitle="Upper Quintile Removed",
x = "Average Weekly Active Time (hours)",
y = "Proportion of Very Active Time (%)")
print(viz_very_active_vs_total_active_no_upper_quint)
# For each intensity level, obtain time coefficients per ID
lightly_active_time_coeffs <- get_time_coefficients(
activity_sum_days_wide$id,
activity_sum_days_wide$activity_day,
activity_sum_days_wide$minutes_lightly_active
) %>%
mutate(intensity = "lightly_active")
fairly_active_time_coeffs <- get_time_coefficients(
activity_sum_days_wide$id,
activity_sum_days_wide$activity_day,
activity_sum_days_wide$minutes_fairly_active
) %>%
mutate(intensity = "fairly_active")
very_active_time_coeffs <- get_time_coefficients(
activity_sum_days_wide$id,
activity_sum_days_wide$activity_day,
activity_sum_days_wide$minutes_very_active
) %>%
mutate(intensity = "very_active")
# Bind coefficients into long-format data frame for visualisation
intensity_time_coeffs <- rbind(
lightly_active_time_coeffs,
fairly_active_time_coeffs,
very_active_time_coeffs
)
viz_all_intensity_over_time <- ggplot(intensity_time_coeffs) +
theme_minimal() +
geom_freqpoly(aes(x = correlation,
color = intensity),
binwidth = 0.1,
size = 1) +
scale_x_continuous(breaks = seq(-1, 1, by = 0.1)) +
scale_y_continuous(breaks = seq(0, 10, by = 1)) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
labs(title = "Correlation Coefficients for Active Hours vs. Time",
subtitle = "Grouped by Intensity",
x = "Correlation with Time",
y = "Users")
print(viz_all_intensity_over_time)
# For each intensity level, obtain time coefficients per ID
daily_intensity_proportions <- intensity_sum_days_wide %>%
mutate(minutes_active = minutes_fairly_active + minutes_very_active,
proportion_very_active = ifelse(minutes_active > 0,
minutes_very_active / minutes_active,
0))
viz_lightly_active_time_coeffs <- get_time_coefficients_plot(
daily_intensity_proportions$id,
daily_intensity_proportions$activity_day,
daily_intensity_proportions$proportion_very_active
) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
labs(title = "Correlation Coefficients for Very Active Proportion Time",
x = "Correlation with Time",
y = "Users")
print(viz_lightly_active_time_coeffs)
brisk_walk_thld <- 80
running_thld <- 110
mean_movement_times_daily <- steps_src_mins_tall %>%
mutate(activity_day = as.POSIXct(format(as.Date(activity_minute)))) %>%
group_by(id, activity_day) %>%
summarize(not_walking = sum(steps == 0) / 60,
moderate_walking = sum(steps > 0 & steps <= brisk_walk_thld) / 60,
brisk_walking = sum(steps > brisk_walk_thld & steps <= running_thld) / 60,
running = sum(steps > running_thld) / 60)
mean_movement_times <- mean_movement_times_daily %>%
group_by(id) %>%
summarize(not_walking = mean(not_walking),
moderate_walking = mean(moderate_walking),
brisk_walking = mean(brisk_walking),
running = mean(running)) %>%
arrange(running)
speed_order = c("not_walking", "moderate_walking", "brisk_walking", "running")
mean_movement_times_long <- mean_movement_times %>%
tidyr::gather(key = "speed", value = "mean_hours", -id) %>%
mutate(speed = factor(speed, levels = speed_order))
mean_movement_times_long$id <- factor(
mean_movement_times_long$id,
levels = mean_movement_times$id
)
ggplot(mean_movement_times_long,
aes(x = id, y = mean_hours, fill = speed)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(breaks = seq(0,33,by=1)) +
labs(title = "Average Daily Movement Speeds by User ID",
x = "User ID",
y = "Average Daily Hours")
viz_walking_speed_frequency <- ggplot(mean_movement_times_long %>% filter(speed != "not_walking")) +
theme_pubclean() +
geom_histogram(aes(x = mean_hours,
fill = speed),
binwidth = 0.25,
position = "dodge",
color = "black") +
scale_x_continuous(breaks = seq(0, 5, by = 0.25)) +
# theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Frequency of Average Walking Hours by Walking Speed",
x = "Mean Hours",
y = "Users")
print(viz_walking_speed_frequency)
# Calculate "Active Walking" times for each user
mean_movement_times <- mean_movement_times %>%
mutate(total_active_walking = brisk_walking + running) %>%
arrange(total_active_walking)
# Plot histogram "Active Walking" times for each user
viz_active_walking <- plot_histo_pareto(mean_movement_times$total_active_walking, bin_width = 0.25) +
labs(title = "Average Daily Active-Walking Hours",
x = "Average Daily Hours")
print(viz_active_walking)
# For each ID, generate proportion Running time
mean_movement_times <- mean_movement_times %>%
mutate(proportion_running = ifelse(total_active_walking > 0,
running / total_active_walking,
0))
viz_running_proportion <- plot_histo_pareto(mean_movement_times$proportion_running,
bin_width = 0.1) +
labs(title = "Percentage of Active Walking Time Spent Running",
x = "% Running")
print(viz_running_proportion)
mean_daily_steps <- steps_sum_days_tall %>%
group_by(id) %>%
summarize(mean_steps = mean(steps_total))
distance_vs_active <- activity_sum_days_wide %>%
select(id,
activity_day,
distance_moderately_active,
distance_very_active,
minutes_fairly_active,
minutes_very_active) %>%
mutate(distance_active = distance_moderately_active + distance_very_active,
minutes_active = minutes_fairly_active + minutes_very_active) %>%
group_by(id) %>%
summarize(mean_distance_active = mean(distance_active),
mean_minutes_active = mean(minutes_active))
# Analyse overnight usage using sleep log data
hours_asleep <- 6
total_nights <- 33
nights_logged_by_sleep_log <- sleep_sum_days_wide %>%
group_by(id) %>%
summarize(count_logged_asleep = sum(minutes_asleep_total / 60 > hours_asleep),
count_logged_in_bed = sum(minutes_in_bed_total / 60 > hours_asleep),
pct_logged_asleep = count_logged_asleep / total_nights,
pct_logged_in_bed = count_logged_in_bed / total_nights)
plot_histo_pareto(nights_logged_by_sleep_log$pct_logged_in_bed, bin_width = 0.05) +
labs(title = "Percentage of Nights Logged In Bed - Logged Users Only",
subtitle = "Source: Sleep Log Data")
# Add missing ID values to the dataset to analyse the full cohort
missing_ids <- setdiff(unique(activity_sum_days_wide$id), nights_logged_by_sleep_log$id)
for (id in missing_ids) {
nights_logged_by_sleep_log <- nights_logged_by_sleep_log %>%
rbind(.,data.frame(
id = id,
count_logged_asleep = 0,
count_logged_in_bed = 0,
pct_logged_asleep = 0.0,
pct_logged_in_bed = 0.0))
}
# Visualise overnight usage using sleep log data
viz_night_usage_sleep_logs <- plot_histo_pareto(nights_logged_by_sleep_log$pct_logged_in_bed, bin_width = 0.05) +
labs(title = "Percentage of Nights Logged In Bed - All Users",
subtitle = "Source: Sleep Log Data")
print(viz_night_usage_sleep_logs)
# Analyse heartrate polling variance
heartrate_polling_variance <- heartrate_src_seconds_tall %>%
group_by(id) %>%
mutate(polling_diff = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
filter(polling_diff < 60)
# Visualise heart rate polling variance
ggplot(heartrate_polling_variance) +
geom_boxplot(aes(y = polling_diff)) +
scale_y_continuous(breaks = seq(0,60,by=5)) +
labs(title="Variation in Heart Rate Polling",
caption="Distribution shown for polling differences less than 60 seconds",
y="Time Between Polls (seconds)")
# Analyse overnight usage using heart rate data
# Find nights where enough heart-rate data was logged to imply the user slept with their FitBit on
sec2hour <- 1 / 3600
# Sleep range set to between 10pm and 6am
sleep_range_stt_hour <- 22
sleep_range_end_hour <- 6
# Anything more than this time between logs is considered a removal (typical poll rate is 1-20 seconds)
max_poll_gap_secs <- 20
# Minimum six hours must be logged to be counted
min_hours_logged <- 6
nights_total <- 33
nights_logged_by_heart_rate <- heartrate_src_seconds_tall %>%
# For logs in the AM, the night-of is set to the day before
mutate(night_of_yday = ifelse(hour(heart_rate_second) >= sleep_range_stt_hour, yday(heart_rate_second),
ifelse(hour(heart_rate_second) < sleep_range_end_hour, yday(heart_rate_second) - 1,
NA))) %>%
# Only consider logs from the "sleep" range
filter(!is.na(night_of_yday)) %>%
# Group by id so that difftime only compares timestamps from the same user
group_by(id) %>%
mutate(time_since_last_poll = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
# Remove any logs that are followed by too long a gap
filter(time_since_last_poll <= max_poll_gap_secs) %>%
group_by(id, night_of_yday) %>%
# Find the total time logged as the sum of time between valid logs
summarize(logged_time_hours = sum(time_since_last_poll) * sec2hour) %>%
# Count up the nights where each user logged sufficient time to be considered asleep with their fitbit
group_by(id) %>%
summarize(nights_logged = sum(logged_time_hours > min_hours_logged),
nights_logged_pct = nights_logged / nights_total)
# Plot histogram showing who logged what fraction of their nights
viz_night_usage_heartrate <- plot_histo_pareto(nights_logged_by_heart_rate$nights_logged_pct, bin_width = 0.05) +
labs(title = "Percentage of Nights Logged In Bed - HR Users Only",
subtitle = "Source: Heart Rate Data")
print(viz_night_usage_heartrate)
# Visualise sleep logs vs heart rate logs
sleep_data_merged <- merge(nights_logged_by_sleep_log, nights_logged_by_heart_rate, by = "id", all = TRUE) %>%
select(id, pct_logged_asleep, nights_logged_pct) %>%
rename(sleep_logs = pct_logged_asleep,
heartrate_logs = nights_logged_pct)
sleep_data_merged <- tidyr::gather(sleep_data_merged, key = "Variable", value = "Value", -id)
viz_sleep_logs_vs_hr_logs <- ggplot(sleep_data_merged, aes(x = id, y = Value, fill = Variable)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7, color = "black") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
labs(title = "Nights Logged: Sleep Logs vs. Heartrate Logs",
x = "ID",
y = "% of Nights Logged",
fill = "Source")
print(viz_sleep_logs_vs_hr_logs)
# Analyse sleep quantity and quality
sleep_sum_quant_qual <- sleep_src_mins_tall %>%
# Set the date for sleep logs in the AM to the day before, i.e. the Night Of [date]
mutate(sleep_date = as.Date(ifelse(hour(sleep_minute) > 12,
as.Date(sleep_minute),
as.Date(sleep_minute) - 1))) %>%
# mutate(sleep_date = as.Date(sleep_date)) %>%
group_by(id, sleep_date) %>%
summarize(minutes_awake = sum(sleep_rank == "Awake"),
minutes_restless = sum(sleep_rank == "Restless"),
minutes_asleep = sum(sleep_rank == "Asleep"),
minutes_total = minutes_awake + minutes_restless + minutes_asleep,
pct_awake = minutes_awake / minutes_total,
pct_restless = minutes_restless / minutes_total,
pct_asleep = minutes_asleep / minutes_total)
sleep_sum_quant_qual_mean <- sleep_sum_quant_qual %>%
group_by(id) %>%
summarize(across(.cols = -sleep_date, \(x) mean(x, na.rm = TRUE), .names = "{.col}_mean")) %>%
mutate(hours_total_mean = minutes_total_mean / 60)
# Visualise sleep quantity
plot_histo_pareto(sleep_sum_quant_qual_mean$hours_total_mean, bin_width = 1) +
labs(title = "Average Hours Asleep",
x = "Hours Asleep")
# Visualise sleep quality
plot_histo_pareto(sleep_sum_quant_qual_mean$pct_asleep_mean, bin_width = 0.05) +
labs(title = "Percentage of Sleep Logs Spent Asleep",
x= "% of Time")
# Brute-force visualisation of overall sleep quality over time
ggplot(sleep_sum_quant_qual,
aes(x = sleep_date,
y = pct_asleep)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
color = "blue") +
stat_cor(mapping=aes(label=..rr.label..),
method="pearson",
label.x=-Inf,
label.y=Inf,
hjust = -0.1,
vjust = 1.1) +
facet_wrap(vars(id))
# Brute-force #### Preliminary Findings Most lines are flat, with some (5-6) that are slightly down
# Visualise changes in sleep quantity over time
get_time_coefficients_plot(sleep_sum_quant_qual$id,
sleep_sum_quant_qual$sleep_date,
sleep_sum_quant_qual$minutes_asleep) +
labs(title = "Correlation Coefficents for Average Total Time Asleep vs. Time")
# Visualise changes in sleep quality over time
get_time_coefficients_plot(sleep_sum_quant_qual$id,
sleep_sum_quant_qual$sleep_date,
sleep_sum_quant_qual$pct_asleep) +
labs(title = "Correlation Coefficents for Average Proportion of Time Asleep vs. Time")
# Find nights where enough heart-rate data was logged to imply the user slept with their FitBit on
sec2hour <- 1 / 3600
# Sleep range set to between 10pm and 6am
period_stt_hour <- 6
period_end_hour <- 22
# Typical poll rate is 1-20 seconds: increased to 60 to allow for device moving out of position during daytime movement
max_poll_gap_secs <- 60
# Minimum six hours must be logged to be counted
min_hours_logged <- 6
periods_total <- 33
logging_days <- TRUE
if (logging_days) {
periods_logged <- heartrate_src_seconds_tall %>%
mutate(yday = ifelse((hour(heart_rate_second) >= period_stt_hour & hour(heart_rate_second) < period_end_hour),
yday(heart_rate_second),
NA))
} else {
# For night logging, set yday to the day before, i.e. "Night of [yday]"
periods_logged <- heartrate_src_seconds_tall %>%
mutate(yday = ifelse(hour(heart_rate_second) >= period_stt_hour,
yday(heart_rate_second),
ifelse(hour(heart_rate_second) < period_end_hour,
yday(heart_rate_second) - 1,
NA)))
}
periods_logged <- periods_logged %>%
# Only consider logs within the period of interest
filter(!is.na(yday)) %>%
# Group by id so that difftime only compares timestamps from the same user
group_by(id) %>%
mutate(time_since_last_poll = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
# Remove any logs that are followed by too long a gap
filter(time_since_last_poll <= max_poll_gap_secs) %>%
group_by(id, yday) %>%
# Find the total time logged as the sum of time between valid logs
summarize(logged_time_hours = sum(time_since_last_poll) * sec2hour) %>%
# Count up the nights where each user logged sufficient time to be considered asleep with their fitbit
group_by(id) %>%
summarize(periods = sum(logged_time_hours > min_hours_logged),
periods_pct = periods / periods_total)
# Plot histogram showing who logged what fraction of their nights
plot_histo_pareto(periods_logged$periods_pct, bin_width = 0.05) +
labs(title = "Percentage of Periods Logged",
subtitle = "Source: Heart Rate Data")
# Analyse intensity levels
# From the Charge HR manual:
# Sedentary = 0-50% Max HR
# Lightly Active = 50-70% Max HR
# Fairly Active = 70-85% Max HR
# Very Active = 85-100% Max HR
# Max HR = 220 - age
# Arbitrarily set median age at 30 for initial analysis
assumed_median_age <- 30
max_heart_rate <- 220 - assumed_median_age
hr_zone_1 <- 0.5 * max_heart_rate
hr_zone_2 <- 0.7 * max_heart_rate
hr_zone_3 <- 0.85 * max_heart_rate
hr_zone_data <- heartrate_src_seconds_tall %>%
mutate(hr_zone = ifelse(heart_rate >= 0 & heart_rate < hr_zone_1, "sedentary",
ifelse(heart_rate >= hr_zone_1 & heart_rate < hr_zone_2, "lightly_active",
ifelse(heart_rate >= hr_zone_2 & heart_rate < hr_zone_3, "fairly_active",
ifelse(heart_rate >= hr_zone_3, "very_active", NA))))) %>%
group_by(id) %>%
summarize(sedentary = sum(hr_zone == "sedentary"),
lightly_active = sum(hr_zone == "lightly_active"),
fairly_active = sum(hr_zone == "fairly_active"),
very_active = sum(hr_zone == "very_active"))
#Convert data to long-format for plotting as a histogram
hr_zone_order = c("sedentary", "lightly_active", "fairly_active", "very_active")
hr_zone_data_long <- hr_zone_data %>%
tidyr::gather(key = "hr_zone", value = "count", -id) %>%
mutate(hr_zone = factor(hr_zone, levels = hr_zone_order))
# Visualise intensity levels analysis
ggplot(hr_zone_data_long, aes(x= id, y= count, fill= hr_zone)) +
geom_bar(stat = "identity",
position = "fill") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title="Count of HR Logs by HR Zone",
x = "User ID",
y = "Count")
analyse_apdx_labs <- grep("^analyse-", knitr::all_labels(), value = TRUE)